昨天介紹完 logistic regression 之後,今天當然要來實作一下了!今天的實作一樣會分成兩部分,上半部分為 Python,下半部分 R 的簡單實作練習。那麼,廢話不多說,我們直接開始吧!
本篇,我們要建立糖尿病預測模型。使用 logistic regression classifier 預測糖尿病。我們先從 kaggle 下載 Pima Indian Diabetes 資料集(https://www.kaggle.com/uciml/pima-indians-diabetes-database),再使用 pandas 讀取 Pima Indian Diabetes。
#import pandas
import pandas as pd
col_names = ['pregnant', 'glucose', 'bp', 'skin', 'insulin', 'bmi', 'pedigree', 'age', 'label']
# load dataset
pima = pd.read_csv("/content/diabetes.csv", header=None, names=col_names)
pima = pima.iloc[1: , :] # drop the first row because we've changed the column names
pima.head()
執行結果為:
接下來,我們需要將給定的列分為依變量(或目標變量)和自變量(或特徵變量)兩種類型。
#split dataset in features and target variable
feature_cols = ['pregnant', 'insulin', 'bmi', 'age','glucose','bp','pedigree']
X = pima[feature_cols] # Features
y = pima.label # Target variable
分為 train set & test set,random_state 大致上等於 set.seed(隨機種子)
# split X and y into training and test sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=16)
建立模型
# import the logistic regression
from sklearn.linear_model import LogisticRegression
# instantiate the model (using the default parameters)
logreg = LogisticRegression(random_state=16)
# fit the model with data
logreg.fit(X_train, y_train)
y_pred = logreg.predict(X_test)
建立 confusion matrix
# import the metrics
from sklearn import metrics
cnf_matrix = metrics.confusion_matrix(y_test, y_pred)
cnf_matrix
執行結果為:
array([[116, 9],
[ 26, 41]])
用 heat map 來看 confusion matrix
# import required modules
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
class_names=[0,1] # name of classes
fig, ax = plt.subplots()
tick_marks = np.arange(len(class_names))
plt.xticks(tick_marks, class_names)
plt.yticks(tick_marks, class_names)
# create heatmap
sns.heatmap(pd.DataFrame(cnf_matrix), annot=True, cmap="YlGnBu" ,fmt='g')
ax.xaxis.set_label_position("top")
plt.tight_layout()
plt.title('Confusion matrix', y=1.1)
plt.ylabel('Actual label')
plt.xlabel('Predicted label')
執行結果為:
看一下整個模型的 performance metrics (accuracy, precision, recall & F1-score)
from sklearn.metrics import classification_report
target_names = ['without diabetes', 'with diabetes']
print(classification_report(y_test, y_pred, target_names=target_names))
執行結果為:
library(text2vec)
library(data.table)
library(magrittr)
data = read.csv("/Users/biaoyun/Documents/Ithome/diabetes.csv")
head(data)
執行結果為:
分 train $ test set
library(caret)
set.seed(16)
trainIndex <- createDataPartition(data$Outcome, p=0.8, list=FALSE)
train_set <- data[trainIndex,]
test_set <- data[-trainIndex,]
library(glmnet)
NFOLDS = 10 # k-folds cross validation
glmnet_classifier = cv.glmnet(as.matrix(train_set), y = train_set$Outcome,
family = 'binomial',
alpha = 1,
type.measure = "auc",
nfolds = NFOLDS,
thresh = 1e-3,
maxit = 1e3)
preds = predict(glmnet_classifier, as.matrix(test_set), type = 'response')[,1]
glmnet:::auc(test_set$Outcome, preds) # using accuracy as the evaluation
執行結果為:
[1] 1
Confusion matrix & Performance metrics
assigner <- function(prediction){
pred_class = c()
for (i in seq_along(prediction)){
if(prediction[i]>0.37){
pred_class[i] <- 1
}else{
pred_class[i] <- 0
}
}
return(pred_class)
}
confusionMatrix(as.factor(assigner(preds)),as.factor(test_set$Outcome))
執行結果為:
今天也感謝大家的收看xd 明天見~